Home Credit Default Risk (HCDR)

The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

Some of the challenges

  1. Dataset size
    • (688 meg uncompressed) with millions of rows of data
    • 2.71 Gig of data uncompressed

Kaggle API setup

Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,

! kaggle competitions files home-credit-default-risk

It is quite easy to setup, it takes me less than 15 minutes to finish a submission.

  1. Install library

For more detailed information on setting the Kaggle API see here and here.

Dataset and how to download

Back ground Home Credit Group

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Background on the dataset

Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.

The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Data files overview

There are 7 different sources of data:

image.png

Downloading the files via Kaggle API

Create a base directory:

DATA_DIR = "../../../Data/home-credit-default-risk"   #same level as course repo in the data directory

Please download the project data files and data dictionary and unzip them using either of the following approaches:

  1. Click on the Download button on the following Data Webpage and unzip the zip file to the BASE_DIR
  2. If you plan to use the Kaggle API, please use the following steps.

Imports

Data files overview

Data Dictionary

As part of the data download comes a Data Dictionary. It named HomeCredit_columns_description.csv

image.png

Application train

Application test

The application dataset has the most information about the client: Gender, income, family status, education ...

The Other datasets

Exploratory Data Analysis

Descriptive statistics

● A data dictionary of the raw features.
● Pandas profiling in jupyter notebook.
● We did descriptive analysis on the dataset such as data type of each feature, dataset size (rows and columns = 307511, 122), and summary statistics such as the number of observations, mean, standard deviation, maximum, minimum, and quartiles for all features and split of data is as follows Train: 70%, Test 20%, Validation 10%.
● We generated charts on descriptive statistics of the target dataset.

Summary of Application train

Missing data for application train

Distribution of the target column

Correlation with the target column

Applicants Age

Applicants occupations

Target Vs borrowers based on gender

Male most likely to default than Female based on percentage of defaulter_count(Second Graph)

Gender Vs Income based on Target

Own House count based Target

Not a significant difference, but borrowers who own a house are more likely to pay

Own car count based Target

Borrowers owning a car are more likely to pay on time

Occupation type count based on Target

Occupation type vs income based on Target

Defaulters percentage is less when IC_ratio is either Low or High

Repayers to Applicants Ratio

Correlation of the positive days since birth and target

Correlation of the positive days since employement and target

Fetching important releavant features

Pandas profiling (contains correlation graphs between features)

Dataset questions

Unique record for each SK_ID_CURR

previous applications for the submission file

The persons in the kaggle submission file have had previous applications in the previous_application.csv. 47,800 out 48,744 people have had previous appications.

Histogram of Number of previous applications for an ID

Can we differentiate applications by low, medium and high previous apps?
* Low = <5 claims (22%)
* Medium = 10 to 39 claims (58%)
* High = 40 or more claims (20%)

Joining secondary tables with the primary table

In the case of the HCDR competition (and many other machine learning problems that involve multiple tables in 3NF or not) we need to join these datasets (denormalize) when using a machine learning pipeline. Joining the secondary tables with the primary table will lead to lots of new features about each loan application; these features will tend to be aggregate type features or meta data about the loan or its application. How can we do this when using Machine Learning Pipelines?

Joining previous_application with application_x

We refer to the application_train data (and also application_test data also) as the primary table and the other files as the secondary tables (e.g., previous_application dataset). All tables can be joined using the primary key SK_ID_PREV.

Let's assume we wish to generate a feature based on previous application attempts. In this case, possible features here could be:

To build such features, we need to join the application_train data (and also application_test data also) with the 'previous_application' dataset (and the other available datasets).

When joining this data in the context of pipelines, different strategies come to mind with various tradeoffs:

  1. Preprocess each of the non-application data sets, thereby generating many new (derived) features, and then joining (aka merge) the results with the application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset) prior to processing the data (in a train, valid, test partition) via your machine learning pipeline. [This approach is recommended for this HCDR competition. WHY?]

I want you to think about this section and build on this.

Roadmap for secondary table processing

  1. Transform all the secondary tables to features that can be joined into the main table the application table (labeled and unlabeled)
    • 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments',
    • 'previous_application', 'POS_CASH_balance'

agg detour

Aggregate using one or more operations over the specified axis.

For more details see agg

DataFrame.agg(func, axis=0, *args, **kwargs**)

Aggregate using one or more operations over the specified axis.

Multiple condition expressions in Pandas

So far, both our boolean selections have involved a single condition. You can, of course, have as many conditions as you would like. To do so, you will need to combine your boolean expressions using the three logical operators and, or and not.

Use &, | , ~ Although Python uses the syntax and, or, and not, these will not work when testing multiple conditions with pandas. The details of why are explained here.

You must use the following operators with pandas:

Missing values in prevApps

feature engineering for prevApp table

feature transformer for prevApp table

Join the labeled dataset

Join the unlabeled dataset (i.e., the submission file)

Processing pipeline

OHE when previously unseen unique values in the test/validation set

Train, validation and Test sets (and the leakage problem we have mentioned previously):

Let's look at a small usecase to tell us how to deal with this:

This last problem can be solved by using the option handle_unknown='ignore'of the OneHotEncoder, which, as the name suggests, will ignore previously unseen values when transforming the test set.

Here is a example that in action:

# Identify the categorical features we wish to consider.
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE', 
               'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']

# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
    ])

OHE case study: The breast cancer wisconsin dataset (classification)

Please this blog for more details of OHE when the validation/test have previously unseen unique values.

HCDR preprocessing

Our Baseline Model

To get a baseline, we will use some of the features after being preprocessed through the pipeline. The baseline model is a logistic regression model

Building Logistic Regression baseline pipeline

Loss function used (data loss and regularization parts) in latex

Submission 1

Improving the AUC

Submission 2

Approach 2

Submission 3

Kaggle submission via the command line API

Screenshot of kaggle submission

Kaggle_submission.png

report submission

Click on this link

Write-up

Abstract

HomeCredit uses Machine Learning Modeling to provide unsecured loans based on a user's credit history, repayment behaviors, and other data. Our main goal is to see how accurately we can predict a new applicant's ability to repay loans. Credit history is a metric that explains a user's reliability based on factors such as the user's average/minimum/maximum balance, Bureau scores recorded, salary, and repayment practices. We used kaggle datasets to perform exploratory data analysis, develop machine learning pipelines, and evaluate models across many evaluation metrics for a model to be deployed as a part of this project. We have built a baseline logistic regression pipeline in Phase-1. We tested our baseline pipeline with a balanced dataset of 50000 non-defaulters, and a balanced dataset of 75000 non-defaulters. The test accuracy of the baseline pipeline turned out to be 91.9%, however the AUC score is only 0.5. After balancing the dataset for the 50000 non-defaulters dataset, we obtained a test accuracy of 71.4% and an AUC score of 0.62. The dataset of 75000 non-defaulters resulted in 76% test accuracy and 0.57 AUC score.

Project Description

Data Description

Task to be tackled:

Due to the size of this data collection, we will conduct EDA at this phase to determine whether characteristics of applicants are associated with a higher chance of default.

The most important task during this phase was Exploratory Data Analysis, but it was also important to figure out which parts of the data collection had missing values and to properly pre-process those variables using numerical encoding.

Workflow

Screen%20Shot%202022-04-12%20at%209.43.34%20PM.png

Feature Engineering and transformers

By using Feature engineering we were able to increase the efficiency of the model by performing following tasks:

1) StandardScaler(): Before applying Machine Learning algorithms to the dataset, We have to carefully understand the magnitude of all key features, which is applicable for feature selection and finding independent and dependent variables. So we scaled them accordingly to accommodate for the analysis and model preparations. This scaling allowed us to limit the wide range of variables in the feature under the certain mathematical approach. We used StandardScaler builtin function to Standardize the features by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation.

Scaling is an important approach that allows us to limit the wide range of variables in the feature under the certain mathematical approach.

2) One-hot encoding: we performed One Hot encoding whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros.

3) Handling Missing Values Train dataset in application. csv

In some datasets, we got the NA values in features. It is nothing but missing data. By handling this type of data there are many ways. The null count and percentage of the features of train dataset are as follows:

Screen%20Shot%202022-04-12%20at%207.17.46%20PM.png

One Hot Encoding

Screen%20Shot%202022-04-12%20at%207.50.44%20PM.png

Handling Missing Values

Screen%20Shot%202022-04-12%20at%207.47.40%20PM.png

Pipelines

The goal here is to predict whether the customer who has reached out to Home Credit for a loan is a defaulter or not. Therefore, this is a supervised classification task, and the output of the target variable is either 1 or 0 where 1 means non-defaulter and 0 means defaulter.

● Logistic Regression can be used as a baseline model along with feature selection techniques like RFE,PCA, SelectKbest.
● Support vector machine is a non-probabilistic binary classifier . It maps training examples to points in space so as to maximize the width of the gap between the two categories. It can also use the kernel trick to perform non linear classification, implicitly mapping the inputs into high dimensional feature space.
● Random forests or random decision forests is an ensemble learning method that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees.
● LightGBM is a fast, distributed, high performance gradient boosting framework based on decision tree algorithms which can be used for classification.
● Deep learning neural networks may be used to improve the prediction model's accuracy, but we wouldn't be able to give the attributes that determine whether or not a client is a defaulter. This would lead to compliance concerns, as we would need to provide the specific features that would cause the loan to be rejected for the likely non-defaulters.

Machine Learning Pipeline Steps:

  1. Data Preprocessing
    a. Gather Kaggle's raw data.
    b. Perform exploratory data analysis on the dataset.
    c. Feature engineering for improving performance of machine learning model.
  2. Model Selection
    a. Develop and test various candidate models, such as "Logistic Regression","Decision Making Trees", "Random Forest", and "SVMs.
    b. Based on the evaluation measures, select the best model.
    c. Use various evaluation metrics like "accuracy," "F1 Score," and "AUC."
  3. Prediction Generation
    a. Prepare the new data and extract the features as before.
    b. Once the winning model has been chosen, use it to make predictions on the new data.

Baseline Logistic Regression Pipeline -

Screen%20Shot%202022-04-12%20at%208.28.01%20PM.png

The above Table reports test accuracy and AUC of the logistic regression baseline results. The AUC is 0.502546. The test accuracy is 91.9.

  1. Improving AUC (a) 50000 non-defaulters Balanced Dataset

Test Dataset

Screen%20Shot%202022-04-12%20at%208.36.08%20PM.png

50000 non-defaulters Balanced dataset

Screen%20Shot%202022-04-12%20at%208.39.37%20PM.png

(b) 75000 non-defaulters Balanced Dataset

Screen%20Shot%202022-04-12%20at%208.43.30%20PM.png

Baseline logistic regression code:

Screen%20Shot%202022-04-12%20at%208.45.57%20PM.png

Experimental results and Discussions

Conclusion

The goal of the HCDR initiative is to forecast the ability of the financially underserved population to repay loans. This project is significant because both the lender and the borrower require well-established predictions. Homecredit can present loan offers with the highest amount and APR to its consumers in real time thanks using ML pipelines, which acquire data from data providers via APIs, run EDA, and fit it to the model to generate scores in microseconds. As a result, risk analysis becomes extremely important in this situation, because NPA (Non-Performing Asset) is expected to be less than 5% in order to run a profitable firm.

Credit history is a measure of a user's credibility that is calculated using characteristics such as the user's average/minimum/maximum balance, Bureau scores reported, salary, and repayment patterns that may be analyzed using the user's past timely defaults/repayments. Other criteria such as location data, social media data, calling/SMS data, and so on are included in alternate data. We would use the datasets given by kaggle for exploratory data analysis, machine learning pipelines, and model evaluation across many evaluation metrics for a model to be deployed as a part of this project.

Our EDA analysis reveals the target dataset's features. We compare the gender differences, income differences, whether they own a home or a car, and the data's jobs in target and non-target groups. Our findings show that women outnumber males among borrowers and targeted. Men earn more money than women. More people own houses than people who do not own houses. More people own cars than people who do not own cars.

We give baseline logistic regression pipelines in Phase-1. We tested a baseline pipeline, a balanced dataset of 50000 non-defaulters, and a balanced dataset of 75000 non-defaulters. The test accuracy of a baseline pipeline is 91.9%, however the AUC is only 0.5%. After rebalancing the dataset, the 50000 non-defaulters dataset has a test accuracy of 71.4% and an AUC score of 0.62. The 75000 non-defaulters dataset demonstrates 76% test accuracy and 0.57 AUC after increasing the sample size. As a result of our findings, we gain a greater AUC but lose test accuracy by re-balancing. These regressions suggest that there is a tradeoff between AUC and test accuracy, and that researchers should carefully decide which to improve.

We believe that we should explore alternative candidate models in the future because these are baseline models without feature engineering or hyper-parameter adjustment. That is, we would examine more feature engineering, hyper-parameter tuning, feature selection and importance, and ensemble approaches in Phase 2. These additional strategies should help to enhance the models' AUC and test accuracy.

Kaggle Submission

Our output dataset on the target has 45673 values of 0, and 3071 values of 1. Kaggle_submission.png

References

Some of the material in this notebook has been adopted from here

https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8

https://online.stat.psu.edu/stat857/node/216/

TODO: Predicting Loan Repayment with Automated Feature Engineering in Featuretools

Read the following: